In my last deliverable my conclusion and hypothesis did not stand when I tried to determine if high scoring means your on a ranked team. My data proved my models to be inaccurate. I thought my problem was that I didn’t have enough variance in my data. So instead of using just the top 25 ranked teams I added most of the ranked teams in the NCAA.
This dataset that I web scraped contains most of the ranked teams in the NCAA. With this I’m hoping for better results because the variance in player stats in top teams is typically much smaller than the difference in stats between good and bad teams. Along with adding a third dataset I tidyed it by lining it up with my first dataset so it will be able to merge on to it. I used gsub to take abbreviations out of the third dataset so they would align with the first dataset which contain no abbreviations. After that I merged both datasets by their school name.
all_teams <- read_html("https://www.ncaa.com/rankings/basketball-men/d1/ncaa-mens-basketball-net-rankings")
rank <- all_teams %>%
html_nodes("tbody") %>%
html_nodes("tr")
Rank <- rank %>%
html_nodes("td:first_child") %>%
html_text() %>%
as.integer()
School <- rank %>%
html_nodes("td:nth_child(3)") %>%
html_text()
Record <- rank %>%
html_nodes("td:nth_child(5)") %>%
html_text()
list_of_schools <- cbind.data.frame(Rank=Rank, School=School, Record=Record)
list_of_schools$School <- gsub("St\\.", "State", list_of_schools$School)
list_of_schools$School <- gsub("Val\\.", "Valley State", list_of_schools$School)
list_of_schools$School <- gsub("Cal State\\.", "CSU", list_of_schools$School)
Allteams <- merge(x=NCAAData, y=list_of_schools, by="School", all.x=TRUE)
First dataset. All NCAA Data.
(Top25)
Second Dataset. Rankings of top 25 teams.
(Rankings)
Third dataset. List of most NCAA teams added.
(list_of_schools)
This view shows the rankings of the top 20 and bottom 20 teams in the dataset to show the addition of ranks for lower ranked teams. With the 300+ teams added to this dataset I’m showing the potential higher variance and opportunity my conclusion can have with these teams added in my model.
filter(list_of_schools, Rank<=20 | Rank>=333) %>%
ggplot(aes(Rank, School)) +
geom_smooth() +
geom_point(aes(School = School), colour = 'blue', size = 1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here I use use my third dataset and test my new model and see if it will yield better results. In this model I used cross-validation methods by incorperating RMSE and MAE test samples to quantify a perdiction error. I also tried calculating the prediction error rate by dividing the RMSE by the average value of the outcome variable. After doing this the model tells us that the varibles are not a good indicator for accurate predictions.
set.seed(385)
top <- filter(Allteams, !is.na(Allteams$Rank))
sample_selection <- top$Rank %>%
createDataPartition(p=0.75, list=FALSE)
train <- top[sample_selection, ]
test <- top[-sample_selection, ]
train_model <- lm(Rank ~ Minutes_Played + Points + Total_Rebounds + Assists + Steals + Field_Goal_Average, data=top)
summary(train_model)
##
## Call:
## lm(formula = Rank ~ Minutes_Played + Points + Total_Rebounds +
## Assists + Steals + Field_Goal_Average, data = top)
##
## Residuals:
## Min 1Q Median 3Q Max
## -216.136 -87.687 -6.046 83.263 224.042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.570199 5.400719 38.989 < 2e-16 ***
## Minutes_Played -0.003087 0.001183 -2.609 0.009104 **
## Points -1.930554 0.356298 -5.418 6.17e-08 ***
## Total_Rebounds -2.372849 0.313292 -7.574 3.97e-14 ***
## Assists -4.601634 0.619575 -7.427 1.21e-13 ***
## Steals 4.737805 1.403096 3.377 0.000737 ***
## Field_Goal_Average 1.348940 0.451283 2.989 0.002805 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101.1 on 9123 degrees of freedom
## Multiple R-squared: 0.02349, Adjusted R-squared: 0.02285
## F-statistic: 36.57 on 6 and 9123 DF, p-value: < 2.2e-16
prediction <- train_model %>% predict(test)
R2(prediction, test$Rank)
## [1] 0.02378007
data.frame( R2 = R2(prediction, test$Steals),
RMSE = RMSE(prediction, test$Steals),
MAE = MAE(prediction, test$Steals))
RMSE(prediction, test$Steals)/mean(test$Steals)
## [1] 91.44357
Here I tried K-fold cross validation to see if anyone of my variables was a good predictor. It turns out none of them are good. I was able to increase my R-value a little bit by adding more variables but it was still not good enough to make a solid perdiction. This tells us that overall my variables are not a good indicator for determining if your on a good team or not.
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
model <- train(Rank ~ Win_Shares_per40_Minutes + Defensive_Rating + Player_Efficiency_Rating + Assist_Percentage + Field_Goal + Points + True_Shooting_Percentage + Free_Throw_Rate + Block_Percentage + Win_Shares + Total_Rebound_Percentage + Total_Rebounds + Assists, data = top, method = "lm",
trControl = train.control)
print(model)
## Linear Regression
##
## 9130 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 8216, 8218, 8217, 8216, 8216, 8219, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 94.02274 0.1556895 79.7215
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
In conclusion it appears that my new model and cross-validation methods did not yield beter results and did not confirm my hypothesis. It increased my R value a little but it was still not enough to back up my hypothesis. I believe this is becuase in my second dataset it included the teams total points, whereas my third dataset did not. In order to operationalize it I would have to obtain total team points for ranked teams in my third dataset which I could not find. I could not find a dataset that contain all the other ranked teams along with their total points. I needed more data in order to prove my hypothesis, data such as total points for all the other teams could have proved it. I was hoping if my data was proved to be useful this could help out scouts of the NBA to determine if a player is good enough or has potential for the next level of basketball which would be the NBA.